Skip to main content

Boosting

  • The boosting paradigm allows the learner to have smooth control over this tradeoff. The learning starts with a basic class (that might have a large approximation error), and as it progresses the class that the predictor may belong to grows richer.
  • A boosting algorithm amplifies the accuracy of weak learners. When a weak learner can be implemented efficiently, boosting provides a tool for aggregating such weak hypotheses to approximate gradually good predictors for larger, and harder to learn, classes.

We will derive Adaboost as a so-called "steepest descent" method (closely related to the gradient descent). Last time, we showed that

gˉ=argming:S{±1}E[exp(Yg(X))]\bar{g}=\underset{g: \mathbb{S} \rightarrow\{\pm 1\}}{\operatorname{argmin}} \mathbb{E}[\exp (-Y g(X))]

satisfies gˉ(x)=12log(1+η(x)1η(x))\bar{g}(x)=\frac{1}{2} \log \left(\frac{1+\eta(x)}{1-\eta(x)}\right) and that signgˉ=g(x)\operatorname{sign} \bar{g}=g_*(x) is the Bayes classifier. Next, we will consider the empirical (based on the training data) analogue of this minimization problem: given a class GG of functions (not necessarily binary classifiers), find the one that minimizes

g1nj=1nexp(Yjg(Xj))g \mapsto \frac{1}{n} \sum_{j=1}^n \exp \left(-Y_j g\left(X_j\right)\right)

We will call the minimizer g^n\hat{g}_n. The specific class of function we will be looking at is constructed as follows: let F\mathcal{F} be the simple "base class" consisting of binary classifiers (e.g. the threshold classifiers or linear separators), and set

G={j=1kαjfj:k1,α0,,αkR,f0,,fkF}\mathbb{G}=\left\{\sum_{j=1}^k \alpha_j f_j: k \geq 1, \alpha_0, \ldots, \alpha_k \in \mathbb{R}, f_0, \ldots, f_k \in \mathcal{F}\right\}

This is known as the "convex hull" of F\mathcal{F} which consists of all convex combinations of the elements of F\mathcal{F}. Another way to think about g=j=1kαjfjg=\sum_{j=1}^k \alpha_j f_j is that the sign of gg is the "majority vote" among f1,,fkf_1, \ldots, f_k where the "vote" of fjf_j carries weight αj\alpha_j.

It turns out that this problem admits an efficient solution under a rather weak assumption on the class F\mathcal{F}, called the "weak learnability" condition.

Definition (Weak Learnability)

We will say that a class F\mathcal{F} of binary classifiers satisfies the weak learnability condition if for any collection of points (xj,yj)j=1nS×{±1},n1\left(x_j, y_j\right)_{j=1}^n \in S \times\{\pm 1\}, n \geq 1 and any nonnegative weights w1,,wn,j=1nwj=1w_1, \ldots, w_n, \sum_{j=1}^n w_j=1, there exists fFf \in \mathcal{F} such that j=1nwjI(Yjf(Xj))1/2\sum_{j=1}^n w_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) \leq 1 / 2 (in simple terms, ff "does better than a random guess").

Note that weak learnability is implied by symmetry, meaning that fFfFf \in \mathcal{F} \Longleftrightarrow-f \in \mathcal{F}; indeed, if j=1nwjI(Yjf(Xj))>12\sum_{j=1}^n w_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right)>\frac{1}{2} then j=1nwjI(Yjf(Xj))<12\sum_{j=1}^n w_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right)<\frac{1}{2}.

Adaboost

Let's examine one step of the (version of) the so-called "steepest descent" algorithm for minimizing 1nj=1nexp(Yjg(Xj))\frac{1}{n} \sum_{j=1}^n \exp \left(-Y_j g\left(X_j\right)\right). The steepest descent is essentially a version of gradient descent method where we use an "approximate" gradient at each step, instead of the true gradient.

Assume that gGg \in \mathbb{G} our current guess. We will look for αR\alpha \in \mathbb{R} and fFf \in \mathcal{F} that minimize (approximately)

1nj=1neYj[g(Xj)+αf(Xj)]=j=1n1neYjg(Xj)eαf(Xj)Yj.\frac{1}{n} \sum_{j=1}^n e^{-Y_j\left[g\left(X_j\right)+\alpha f\left(X_j\right)\right]}=\sum_{j=1}^n \frac{1}{n} e^{-Y_j g\left(X_j\right)} e^{-\alpha f\left(X_j\right) Y_j} .

Intuitively, such an ff - the direction of "steepest descent" - can be seen as an approximate gradient. Define wj=1neYjf(Xj),j=1,,nw_j=\frac{1}{n} e^{-Y_j f\left(X_j\right)}, j=1, \ldots, n, to be the weights. Note that wj0w_j \geq 0. Let w~j=wjj=1nwj\tilde{w}_j=\frac{w_j}{\sum_{j=1}^n w_j}, so that j=1nw~j=1\sum_{j=1}^n \tilde{w}_j=1. Our problem is then to minimize j=1nw~jeαf(Xj)Yj\sum_{j=1}^n \tilde{w}_j e^{-\alpha f\left(\overline{X_j}\right) Y_j} over fF,αRf \in \mathcal{F}, \alpha \in \mathbb{R}. Since ff takes only two values ±1\pm 1, we have that

j=1nw~jeαf(Xj)Yj=j=1nw~jeαI(Yj=f(Xj))+j=1nw~jeαI(Yjf(Xj))±j=1nw~jeαI(Yjf(Xj))=eα+(eαeα)j=1nw~jI(Yjf(Xj))\begin{aligned} \sum_{j=1}^n \tilde{w}_j e^{-\alpha f\left(X_j\right) Y_j} &=\sum_{j=1}^n \tilde{w}_j e^{-\alpha} \mathbb{I}\left(Y_j=f\left(X_j\right)\right)+\sum_{j=1}^n \tilde{w}_j e^\alpha \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) \pm \sum_{j=1}^n \tilde{w}_j e^{-\alpha} \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) \\ &=e^{-\alpha}+\left(e^\alpha-e^{-\alpha}\right) \sum_{j=1}^n \tilde{w}_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) \end{aligned}

where en,w~(f)=j=1nw~jI(Yjf(Xj))e_{n, \tilde{w}}(f)=\sum_{j=1}^n \tilde{w}_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) is the "weighted" training error. To minimize the resulting expression, we proceed in two steps:

  1. Minimize j=1nw~jI(Yjf(Xj))\sum_{j=1}^n \tilde{w}_j \mathbb{I}\left(Y_j \neq f\left(X_j\right)\right) with respect to ff
  2. Minimize eα+(eαeα)en,w~(f)e^{-\alpha}+\left(e^\alpha-e^{-\alpha}\right) e_{n, \tilde{w}}(f) with respect to α\alpha. The "weak learnability" assumption is needed to complete step 1: recall that for any nonnegative weights w~1,,w~n\tilde{w}_1, \ldots, \tilde{w}_n with j=1nw~j=1,fF\sum_{j=1}^n \tilde{w}_j=1, \exists f \in \mathcal{F} such that en,w~(f)12e_{n, \tilde{w}}(f) \leq \frac{1}{2}. We will assume access to a "black box" weak learning algorithm that takes w~1,,w~n\tilde{w}_1, \ldots, \tilde{w}_n and (X1,Y1),,(Xn,Yn)\left(X_1, Y_1\right), \ldots,\left(X_n, Y_n\right) as inputs and outputs some fFf \in \mathcal{F} such that en,w~(f)12e_{n, \tilde{w}}(f) \leq \frac{1}{2}. Exercise. Describe such a "weak learning" algorithm for the class consisting of
  3. threshold classifiers gt+g_t^{+}and gtg_t^{-}
  4. linear separators (e.g., you could rely on Logistic regression). Assuming that en,w~(f)12e_{n, \tilde{w}}(f) \leq \frac{1}{2}, the minimum of eα+(eαeα)en,w~(f)e^{-\alpha}+\left(e^\alpha-e^{-\alpha}\right) e_{n, \tilde{w}}(f) occurs for
    α^=12log1en,w~(f)en,w~(f)0.\hat{\alpha}=\frac{1}{2} \log \frac{1-e_{n, \tilde{w}}(f)}{e_{n, \tilde{w}}(f)} \geq 0 .
    Adaboost is an algorithm that repeats the steps outlined above. We present it now.

Adaboost algorithm:

Initialize wj(0)=1n,j=1,,nw_j^{(0)}=\frac{1}{n}, j=1, \ldots, n. For t=0,,Tt=0, \ldots, T do

  • Call the weak learner (WL);
  • Output ftf_t such that en,w(0)(ft)12e_{n, w^{(0)}}\left(f_t\right) \leq \frac{1}{2}
  • Update the weights wj(t+1)=wj(t)exp(Yjαtft(Xj))Zt,j=1nw_j^{(t+1)}=\frac{w_j^{(t)} \exp \left(-Y_j \alpha_t f_t\left(X_j\right)\right)}{Z_t}, j=1 \ldots n, where Zt=j=1nwj(t)exp(Yjαtft())Z_t=\sum_{j=1}^n w_j^{(t)} \exp \left(-Y_j \alpha_t f_t(\cdot)\right) is the "normalizing factor."
  • Output: g^T()=sign(j=1Tαtft())\widehat{g}_T(\cdot)=\operatorname{sign}\left(\sum_{j=1}^T \alpha_t f_t(\cdot)\right).

Theorem

Assume that at each step, WL outputs ftf_t such that

en,w(t)(ft)=j=1nwj(t)I{Yjft(Xj)}12γe_{n, w^{(t)}}\left(f_t\right)=\sum_{j=1}^n w_j^{(t)} I\left\{Y_j \neq f_t\left(X_j\right)\right\} \leq \frac{1}{2}-\gamma

for some γ>0\gamma>0. Then the training error corresponding to the classifier g^T\hat{g}_T satisfies

1nj=1nI{Yjg^T(Xj)}exp(2Tγ2).\frac{1}{n} \sum_{j=1}^n I\left\{Y_j \neq \widehat{g}_T\left(X_j\right)\right\} \leq \exp \left(-2 T \gamma^2\right) .

Proof

a) Note that wj(T+1)=1neYjt=1Tαtft(Xj)t=1TZtw_j^{(T+1)}=\frac{1}{n} \frac{e^{-Y_j \sum_{t=1}^T \alpha_t f_t\left(X_j\right)}}{\prod_{t=1}^T Z_t}; this is easy to show by induction and the details will be filled in during the lecture. b) We have that

1nj=1nI{Yjg^T(Xj)}=1nj=1nI{Yjt=1Tαtft(Xj)0}1nj=1neYjt=1αtft(Xj)=1nj=1nwj(T+1)nt=1TZt=t=1TZt\begin{aligned} \frac{1}{n} \sum_{j=1}^n I\left\{Y_j \neq \widehat{g}_T\left(X_j\right)\right\} &=\frac{1}{n} \sum_{j=1}^n I\left\{Y_j \sum_{t=1}^T \alpha_t f_t\left(X_j\right) \leq 0\right\} \\ & \leq \frac{1}{n} \sum_{j=1}^n e^{-Y_j \sum_{t=1} \alpha_t f_t\left(X_j\right)} \\ &=\frac{1}{n} \sum_{j=1}^n w_j^{(T+1)} n \prod_{t=1}^T Z_t \\ &=\prod_{t=1}^T Z_t \end{aligned}

c) For ZtZ_t at each step

Zt=j=1nwj(t)exp(Yjαtft(Xj))=j=1nwj(t)I{Yj=ft(Xj)}eαt+j=1nwj(t)I{Yjft(Xj)}eαt±j=1nwjI{Yjft(Xj)}eαt=eαt+(eαteαt)j=1nwj(t)I{Yjft(Xj)},\begin{aligned} Z_t &=\sum_{j=1}^n w_j(t) \exp \left(-Y_j \alpha_t f_t\left(X_j\right)\right) \\ &=\sum_{j=1}^n w_j^{(t)} I\left\{Y_j=f_t\left(X_j\right)\right\} e^{-\alpha_t}+\sum_{j=1}^n w_j^{(t)} I\left\{Y_j \neq f_t\left(X_j\right)\right\} e^{\alpha_t} \pm \sum_{j=1}^n w_j I\left\{Y_j \neq f_t\left(X_j\right)\right\} e^{-\alpha_t} \\ &=e^{-\alpha_t}+\left(e^{\alpha_t}-e^{-\alpha_t}\right) \sum_{j=1}^n w_j^{(t)} I\left\{Y_j \neq f_t\left(X_j\right)\right\}, \end{aligned}

where the last multiplicand is en,w(t)(ft)e_{n, w^{(t)}}\left(f_t\right). Recall that αt=12log(1en,w(t)(ft)en,w(t)(ft))\alpha_t=\frac{1}{2} \log \left(\frac{1-e_{n, w^{(t)}}\left(f_t\right)}{e_{n, w^{(t)}\left(f_t\right)}}\right), we thus have that

Zt=2en,w(t)(ft)(1en,w(t)(ft)).Z_t=2 \sqrt{e_{n, w^{(t)}}\left(f_t\right)\left(1-e_{n, w^{(t)}}\left(f_t\right)\right)} .

d) The function f(x)=x(1x);x[0,12γ]f(x)=x(1-x) ; x \in\left[0, \frac{1}{2}-\gamma\right] is maximized for x=12γx=\frac{1}{2}-\gamma, thus

Zt2(1/2γ)(1/2+γ)14γ2e4γ2=e2γ2,\mathbb{Z}_t \leq 2 \sqrt{(1 / 2-\gamma)(1 / 2+\gamma)} \leq \sqrt{1-4 \gamma^2} \leq \sqrt{e^{-4 \gamma^2}}=e^{-2 \gamma^2},

since 1xex1-x \leq e^{-x} for x[0,1]x \in[0,1]. Therefore

1nj=1nI{Yjg^T(Xj)}=t=1TZtexp(2Tγ2).\frac{1}{n} \sum_{j=1}^n I\left\{Y_j \neq \widehat{g}_T\left(X_j\right)\right\}=\prod_{t=1}^T Z_t \leq \exp \left(-2 T \gamma^2\right) .

In conclusion, the training error goes to 0 exponentially fast. However, the main object of interest is the "generalization error"

P(Yg^T(X))0.P\left(Y \widehat{g}_T(X)\right) \leq 0 .

Fortunately, in this case the generalization error is also going to be small: if the VC dimension of the base class F\mathcal{F} is V\mathrm{V}, then the generalization error is at most KVnK \sqrt{\frac{V}{n}} larger than the training error with high probability, where KK is some fixed number. The proof of this fact is rather difficult however and won't be covered in this course.